information extraction
HunyuanOCR Technical Report
Hunyuan Vision Team, null, Lyu, Pengyuan, Wan, Xingyu, Li, Gengluo, Peng, Shangpin, Wang, Weinong, Wu, Liang, Shen, Huawen, Zhou, Yu, Tang, Canhui, Yang, Qi, Peng, Qiming, Luo, Bin, Yang, Hower, Zhang, Xinsong, Zhang, Jinnian, Peng, Houwen, Yang, Hongming, Xie, Senhao, Zhou, Longsha, Pei, Ge, Wu, Binghong, Yan, Rui, Wu, Kan, Yang, Jieneng, Wang, Bochao, Liu, Kai, Zhu, Jianchen, Jiang, Jie, Linus, null, Hu, Han, Zhang, Chengquan
This paper presents HunyuanOCR, a commercial-grade, open-source, and lightweight (1B parameters) Vision-Language Model (VLM) dedicated to OCR tasks. The architecture comprises a Native Vision Transformer (ViT) and a lightweight LLM connected via an MLP adapter. HunyuanOCR demonstrates superior performance, outperforming commercial APIs, traditional pipelines, and larger models (e.g., Qwen3-VL-4B). Specifically, it surpasses current public solutions in perception tasks (Text Spotting, Parsing) and excels in semantic tasks (IE, Text Image Translation), securing first place in the ICDAR 2025 DIMT Challenge (Small Model Track). Furthermore, it achieves state-of-the-art (SOTA) results on OCRBench among VLMs with fewer than 3B parameters. HunyuanOCR achieves breakthroughs in three key aspects: 1) Unifying Versatility and Efficiency: We implement comprehensive support for core capabilities including spotting, parsing, IE, VQA, and translation within a lightweight framework. This addresses the limitations of narrow "OCR expert models" and inefficient "General VLMs". 2) Streamlined End-to-End Architecture: Adopting a pure end-to-end paradigm eliminates dependencies on pre-processing modules (e.g., layout analysis). This fundamentally resolves error propagation common in traditional pipelines and simplifies system deployment. 3) Data-Driven and RL Strategies: We confirm the critical role of high-quality data and, for the first time in the industry, demonstrate that Reinforcement Learning (RL) strategies yield significant performance gains in OCR tasks. HunyuanOCR is officially open-sourced on HuggingFace. We also provide a high-performance deployment solution based on vLLM, placing its production efficiency in the top tier. We hope this model will advance frontier research and provide a solid foundation for industrial applications.
- Workflow (1.00)
- Research Report (1.00)
- Education (1.00)
- Media (0.67)
- Health & Medicine (0.67)
Neurosymbolic Information Extraction from Transactional Documents
Hemmer, Arthur, Coustaty, Mickaël, Bartolo, Nicola, Ogier, Jean-Marc
This paper presents a neurosymbolic framework for information extraction from documents, evaluated on transactional documents. We introduce a schema-based approach that integrates symbolic validation methods to enable more effective zero-shot output and knowledge distillation. The methodology uses language models to generate candidate extractions, which are then filtered through syntactic-, task-, and domain-level validation to ensure adherence to domain-specific arithmetic constraints. Our contributions include a comprehensive schema for transactional documents, relabeled datasets, and an approach for generating high-quality labels for knowledge distillation. Experimental results demonstrate significant improvements in $F_1$-scores and accuracy, highlighting the effectiveness of neurosymbolic validation in transactional document processing.
- Europe > France (0.04)
- North America > United States > California > Los Angeles County > Santa Monica (0.04)
- Europe > Greece > Attica > Athens (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Data Science > Data Mining > Text Mining (0.72)
- Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.72)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai
Nonesung, Surapon, Jaknamon, Teetouch, Chaiophat, Sirinya, Nitarach, Natapong, Wittayasakpan, Chanakan, Sirichotedumrong, Warit, Na-Thalang, Adisai, Pipatanakul, Kunat
We present ThaiOCRBench, the first comprehensive benchmark for evaluating vision-language models (VLMs) on Thai text-rich visual understanding tasks. Despite recent progress in multimodal modeling, existing benchmarks predominantly focus on high-resource languages, leaving Thai underrepresented, especially in tasks requiring document structure understanding. ThaiOCRBench addresses this gap by offering a diverse, human-annotated dataset comprising 2,808 samples across 13 task categories. We evaluate a wide range of state-of-the-art VLMs in a zero-shot setting, spanning both proprietary and open-source systems. Results show a significant performance gap, with proprietary models (e.g., Gemini 2.5 Pro) outperforming open-source counterparts. Notably, fine-grained text recognition and handwritten content extraction exhibit the steepest performance drops among open-source models. Through detailed error analysis, we identify key challenges such as language bias, structural mismatch, and hallucinated content. ThaiOCRBench provides a standardized framework for assessing VLMs in low-resource, script-complex settings, and provides actionable insights for improving Thai-language document understanding.
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (3 more...)
Information Extraction From Fiscal Documents Using LLMs
Aggarwal, Vikram, Kulkarni, Jay, Mascarenhas, Aditi, Narang, Aakriti, Raman, Siddarth, Shah, Ajay, Thomas, Susan
Large Language Models (LLMs) have demonstrated remarkable capabilities in text comprehension, but their ability to process complex, hierarchical tabular data remains underexplored. We present a novel approach to extracting structured data from multi-page government fiscal documents using LLM-based techniques. Applied to annual fiscal documents from the State of Karnataka in India (200+ pages), our method achieves high accuracy through a multi-stage pipeline that leverages domain knowledge, sequential context, and algorithmic validation. A large challenge with traditional OCR methods is the inability to verify the accurate extraction of numbers. When applied to fiscal data, the inherent structure of fiscal tables, with totals at each level of the hierarchy, allows for robust internal validation of the extracted data. We use these hierarchical relationships to create multi-level validation checks. We demonstrate that LLMs can read tables and also process document-specific structural hierarchies, offering a scalable process for converting PDF-based fiscal disclosures into research-ready databases. Our implementation shows promise for broader applications across developing country contexts.
- Asia > India > Karnataka (0.27)
- Asia > Singapore (0.07)
- Asia > India > Maharashtra > Mumbai (0.05)
- (2 more...)
- Research Report (1.00)
- Overview > Innovation (0.34)
Supervised Fine Tuning of Large Language Models for Domain Specific Knowledge Graph Construction:A Case Study on Hunan's Historical Celebrities
Hao, Junjie, Wang, Chun, Qiao, Ying, Zuo, Qiuyue, Song, Qiya, Ma, Hua, Gao, Xieping
Large language models and knowledge graphs hold broad application potential in the field of historical culture, facilitating the excavation, research, and comprehension of cultural heritage. Taking Hunan's historical celebrities emerging from modern Huxiang culture as a case, pre-trained large models can assist researchers in rapidly extracting specific historical figure information from literature--including basic details, life events, and social relationships--and constructing structured knowledge graphs, thereby supporting related research. Currently, systematic data collection on Hunan's historical celebrities remains scarce. Moreover, general-purpose large language models often exhibit insufficient domain knowledge extraction accuracy and weak structured output capabilities in such low-resource scenarios. Therefore, this paper proposes a supervised fine-tuning approach for domain-specific large models to enhance the quality and efficiency of information extraction regarding Hunan's historical celebrities. Specifically, this paper first designs a fine-grained schema-guided instruction fine-tuning template for the Hunan's historical celebrities domain. Using this template, we construct an instruction fine-tuning dataset, addressing the current lack of instruction datasets in domain-specific model fine-tuning. Second,we conducted parameter-efficient instruction fine-tuning on four publicly available large language models--Qwen2.5-7B, Qwen3-8B, DeepSeek-R1-Distill-Qwen-7B, and Llama-3.1-8B-Instruct--using the proposed instruction dataset, and established evaluation criteria for assessing their performance in character information extraction. Experimental results demonstrate that the performance of all four base models significantly improved after domain-specific fine-tuning. Among them, Qwen3-8B achieved the best performance after training with 100 samples and 50 fine-tuning iterations, scoring 89.3866 on the evaluation metrics. This research offers new insights for fine-tuning vertical large models tailored to regional historical and cultural domains, holding significant implications for promoting the cost-effective application of large models and knowledge graphs in the field of historical and cultural heritage. Introduction With the rapid advancement of large language models (LLMs), unprecedented opportunities have emerged for the in-depth exploration, systematic research, and widespread dissemination of Huxiang culture. Simultaneously, this presents new challenges for the digital transformation of traditional cultural resources[1].
- Asia > China > Hunan Province (0.14)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
- (2 more...)
- Health & Medicine (0.46)
- Government (0.46)
- Information Technology (0.46)
A Reasoning Paradigm for Named Entity Recognition
Huang, Hui, Chen, Yanping, Huang, Ruizhang, Lin, Chuan, Qin, Yongbin
Generative LLMs typically improve Named Entity Recognition (NER) performance through instruction tuning. They excel at generating entities by semantic pattern matching but lack an explicit, verifiable reasoning mechanism. This "cognitive shortcutting" leads to suboptimal performance and brittle generalization, especially in zero-shot and lowresource scenarios where reasoning from limited contextual cues is crucial. To address this issue, a reasoning framework is proposed for NER, which shifts the extraction paradigm from implicit pattern matching to explicit reasoning. This framework consists of three stages: Chain of Thought (CoT) generation, CoT tuning, and reasoning enhancement. First, a dataset annotated with NER-oriented CoTs is generated, which contain task-relevant reasoning chains. Then, they are used to tune the NER model to generate coherent rationales before deriving the final answer. Finally, a reasoning enhancement stage is implemented to optimize the reasoning process using a comprehensive reward signal. This stage ensures explicit and verifiable extractions. Experiments show that ReasoningNER demonstrates impressive cognitive ability in the NER task, achieving competitive performance. In zero-shot settings, it achieves state-of-the-art (SOTA) performance, outperforming GPT-4 by 12.3 percentage points on the F1 score. Analytical results also demonstrate its great potential to advance research in reasoningoriented information extraction. Our codes are available at https://github.com/HuiResearch/ReasoningIE.
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Scaling Open-Weight Large Language Models for Hydropower Regulatory Information Extraction: A Systematic Analysis
Yoon, Hong-Jun, Ashraf, Faisal, Ruggles, Thomas A., Singh, Debjani
Information extraction from regulatory documents using large language models presents critical trade-offs between performance and computational resources. We evaluated seven open-weight models (0.6B-70B parameters) on hydropower licensing documentation to provide empirical deployment guidance. Our analysis identified a pronounced 14B parameter threshold where validation methods transition from ineffective (F1 $<$ 0.15) to viable (F1 = 0.64). Consumer-deployable models achieve 64\% F1 through appropriate validation, while smaller models plateau at 51\%. Large-scale models approach 77\% F1 but require enterprise infrastructure. We identified systematic hallucination patterns where perfect recall indicates extraction failure rather than success in smaller models. Our findings establish the first comprehensive resource-performance mapping for open-weight information extraction in regulatory contexts, enabling evidence-based model selection. These results provide immediate value for hydropower compliance while contributing insights into parameter scaling effects that generalize across information extraction tasks.
- North America > United States > Tennessee > Anderson County > Oak Ridge (0.04)
- North America > United States > Montana > Roosevelt County (0.04)
- North America > Canada > Alberta (0.04)
- (2 more...)
- Government > Regional Government > North America Government > United States Government (1.00)
- Energy > Renewable (1.00)
- Energy > Power Industry (0.90)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Knots: A Large-Scale Multi-Agent Enhanced Expert-Annotated Dataset and LLM Prompt Optimization for NOTAM Semantic Parsing
Liu, Maoqi, Fang, Quan, Yang, Yang, Zhao, Can, Cai, Kaiquan
Notice to Air Missions (NOTAMs) serve as a critical channel for disseminating key flight safety information, yet their complex linguistic structures and implicit reasoning pose significant challenges for automated parsing. Existing research mainly focuses on surface-level tasks such as classification and named entity recognition, lacking deep semantic understanding. To address this gap, we propose NOTAM semantic parsing, a task emphasizing semantic inference and the integration of aviation domain knowledge to produce structured, inference-rich outputs. To support this task, we construct Knots (Knowledge and NOTAM Semantics), a high-quality dataset of 12,347 expert-annotated NOTAMs covering 194 Flight Information Regions, enhanced through a multi-agent collaborative framework for comprehensive field discovery. We systematically evaluate a wide range of prompt-engineering strategies and model-adaptation techniques, achieving substantial improvements in aviation text understanding and processing. Our experimental results demonstrate the effectiveness of the proposed approach and offer valuable insights for automated NOTAM analysis systems. Our code is available at: https://github.com/Estrellajer/Knots.
- Asia > China > Beijing > Beijing (0.05)
- North America > United States > Mississippi > Marion County (0.04)
- North America > United States > Louisiana (0.04)
- (5 more...)
- Transportation > Air (1.00)
- Transportation > Infrastructure & Services (0.93)
- Asia > Singapore (0.04)
- Asia > China > Hubei Province > Wuhan (0.04)
- Asia > China > Heilongjiang Province > Harbin (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)